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ABSTRACT 

This paper examines the validity of the concept of 
linguistic units in a theory of speech production. Substantiating 
data are drawn from the study of the speech production process 
itself. Secondarily^ an attempt is made to reconcile the postulation 
of linguistic units in speech production theory with their apparent 
absence in the speech signal. (Author/DD) 
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The first purpose of this paper is to review some kind 



s of 



evidence for what I will call the reality status of concepts of 
linguistic units in a theory of speech production. By this I mean 
I will review some evidence derived primarily from the study of 
the speech production process itself, that suggests that the 
speaker manipulates linguistic units as units when he produces 
an utterance and some evidence that suggests how he does it. The 
question of the reality status of units in speech production theory 
needs to be asked for two reasons. First, it is not at all neces- 
sary, a priori, that a postulated linguistic unit of any particular 
size or type, that has arisen from study of anything other than 
actual speech production itself is entitled to reality status in 
the production process. Second, even when considering speech pro- 
duction itself, it is very difficult to find direct evidence for 
linguistic units because speech is a continually variable output 
that does not readily lend itself to segmentation, as many phoneti- 
cians and engiiieers have found in the past few years. This lack 
of direct availability of linguistic units in speech output gives 
rise to the second purpose of this paper, namely to take some 
stops towards reconciling the postulation of linguistic units in 
speech production theory, with their apparent absence in the 
speech signal. 

*Invited paper presented at the 85th Meeting of the Acoustical 
society of America, Boston, Massachusetts, April 13, 1973. 
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One of the problems of studying linguistic units in the speech 
production process is that we don't have as much experimental con- 
trol over a speaker's behavior as we do in speech perception studies, 
where we can manipulate variables related to linguistic units and 
observe the listener's response to them. This is one reason why 
we have become heavily dependent on errors in spontaneous output 
of speakers in our study of production units, I now want to con- 
sider some of the information provided by speech errors about 
linguistic units in speech production, I regret to say that most 
of the data and interpretation I give here is not from my own 
work but mainly from two important papers by MacKay (1969) and 
Fromkin (1971), I regret it because ever since I read Lashley's 
well known (Lashley, 1951) paper on serial order I have been in- 
terested in the contribution that speech errors could make to 
models of speech production. However, the only sizeable corpus 
of speech errors that I could discover at that time was a compen- 
dium of radio and TV blunders , some of which are quite salacious , 
entitled "Pardon My Blooper", Although I did write a paper analy- 
zing these errors , I never tried to get it published as I felt I 
could not find an editor whose estimation of the paper's scientific 
merit exceeded his sense of propriety, (I suppose I should at 
loast have given it a chance to become an underground classic.) 
Instead I wrote a paper on typing errors, the data for which was 
readily available from student's lab reports (MacNeilage, 1964). 
Unlike Fromkin, I did not have the insight to realize that speech 
errors occur with sufficient frequency in everyday situations to 
enable compilation of a large corpus in a short time. 
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Without further ado, here is an analysis of some of the impli- 
cations of speech errors for linguistic unit concepts in speech 
production theory. 

Practically every unit of speech that has been postulated 
by linguists or others seems to have reality status in the speech 
production process in that they can be substituted, transposed, 
omitted, or added as a unit in an otherwise correct sequence. 
The distinctive feature, the phoneme, the syllable, the morpheme 
and the word all have a claim to reality in this sense. 

Speech errors also serve to give information about the inven- 
tory of members of any given class of units. For instance, with 
respect to segments or phonemes, one does not observe diphthongs 
or affricates splitting into two components, one of which parti- 
cipates in an ordering error, thus suggesting that they are single 
phonemic units rather than closely associated pairs of units. On 
the other hand, individual consonants in a consonant cluster some- 
times act independently in an ordering error, suggesting that 
clusters are best regarded as groupings of individual phonemes. 

Similarly, postulated distinctive features can be evaluated 
by noting whether or not their physical correlates are independently 
variable in an otherwise stable sequence of speech output. In this 
regard a relatively small number of features such as nasality, 
voicing and place of articulation, which obviously do have indepen- 
dent control possibilities do sometimes seem to be -.ndt^pondontly 
variable. For example, in "Cedars of Lemadon" for "Cedars of 
Lebanon" (Fromkin^ 1971) the nasality feature varies independently. 



On the other hand, more abstract features such as coronal, which 
are not defined in such a way as to allow independent control do 
not appear tc be plausible explanations for errors. 

Larger units also participate in speech errors as units. 
Examples are : 

a. Words: A lot of bridge has passed under the water since 

then. 

b. Syllables: "opacity and specif ity" for "opacity and 

specificity" (Fromkin, 1971) . 

Another sense in which units larger than the distinctive 
feature and phoneme can be said to have reality is that they place 
strong constraints on the positional privileges governing phoneme 
or feature errors. Preyocalic consonants, vocalic nuclei (vowels), 
and post-vocalic consonants rarely occur erroneously other than 
in the same syllable positions that they originated in. 

A schematic view of the speech production process incorporating 
these facts as well as a number of other implications of speech 
errors is shown in the first slide. ^ The box at the top. requires 
little discussion. It. recognizes the necessity of an initial 
idea, plan, or intention which is preverbal and is typically 
though not always, "satisfied" by production of a particular 
sentence . 

The intention must include some semantic information to 
be made more specific in the formation of a sentence. The next 
step mdy be to decide on the general syntactic form of the utter- 
ance, at least to a point sufficient to allow the generation of 



^This figure comes from MacNeilage and MacNeilage, 1973. 



the overall form of the sentence intonation contour. For example, 
a choice may be made of "1*11 have a steak" and not "A steak for 
me," so that the main sentence stress is assigned to the last 
word in the sentence. Following this, there may be two parallel 
operations, the generation of the sentence intonation contour, and 
the selection of appropriate lexical items with their associated 
stress patterns. Some characteristics of the lexical selection 
process are indicated by speech errors reported by Fromkin. One 
possibility is a "semantic" error, e.g. "hate" for "like", which • 
appears to involve selection of the wrong value for a semantic 
feature and a following lookup at a wrong lexical address (Fromkin, 
1971). A second class of error, e.g. "pressure" for "present", 
appears to involve selection of a wrong lexical item phonetically 
"near" the correct one in the lexical storage system (Fromkin, 
1971) . 

I have to confess I don't know very much about the stages of 
production I have been talking about so far and no doubt this 
part of the model could be considerably improved. I have included 
it largely ^to get us to the next stage where speech error data 
begin to pr/pvide some more obvious constraints on the form of the 
model, from my point of view. It is necessary to have a buffer, 
or temporary storage stage, in which a n\imber of lexical items 
and a sentence intonation contour can coexist, for a number of 
roaaons . First, it is necessary to have a number of words avail' 
able simultaneously to account for transpositions of words, as 
hashlcy initially pointed out. In one of Fromkin 's examples a 
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"computer in our own laboratory" becomes "a laboratory in our own 
computer," Second, it is necessary for these transpositions to 
take place before the assignment of the sentence intonation contour 
as evidenced by the fact that whereas in the example I just cited 
the correct version would have a major sentence stress on the 
first syllable of "laboratory," the transposed version received 
major sentence stress on the second syllable of "computer". 
Blends may be accounted for by a selection of parts of two lexical 
items which were simultaneously available, in temporary store, 
probably because a definite choice could not be made between them 
at the previous lexical item assignment stage to fill one lexical 
slot. The blend "grisp" which was derived from "grip" and "grasp" 
is an example of this. The items in this store must have at least 
the syllabic position of the main lexical stress already assigned, 
otherwise it would not be possible for the sentence stress to fall 
on the appropriate most highly stressed syllable of the lexical 
item, as we observed in "computer" where stress is on the second 
syllable even though the stress for laboratory would have been on 
the first. For stress to be assigned^ syllable structure must 
also have been specified prior to that point, as the syllable is 
the domain of stress. At this point, although lexical items are 
specified as units, there remains some "fluidity" in the linkages 
within lexical items, such that syllables, phonemes, and distinctive 
features are in some sense separately available for selection, 
thus allowing the transposition of syllables, phonemes, and dis- 
tinctive features which takes place in speech errors. 
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The next step may be the serially ordered removal of items 
from the buffer by a scanning mechanism, as proposed by Lashley. 
This mechanism probably imposes a particular speaking rate on the 
output. This may result in the transfer of the items to a stage 
of morphophonemic and phonological monitoring. We know that the 
scanning mechanism is susceptible to being "confused" in its 
serial selection of simultaneously available material both by 
stress values assigned to syllables and by segmental properties, 
and these variables seem to be the main source of serial ordering 
errors at the phonetic level (MacKay, 1970). Segment or feature 
reversals typically involve similar segments and the reversed pair 
is often preceded or followed by an identical phoneme which seems 
to have a "triggering" role. Components of stressed syllables 
are especially likely to participate in reversals, which suggests 
that the components being advanced in the order are in some way 
especially salient to the scanning mechanism. One reason for 
postulating a buffer store of finite size is to account for the 
inclusion of the delayed component in the spoonerism. If it is 
in a buffer, it would presumably remain available for selection 
by the scanner when it comes to the point when.it requires the 
unit that has already been advanced. 

It is necessary to postulate the morphophonemic and phonological 
monitoring stage following the serialization produ :ed by the 
scanning mechanism, because there are available numerous instances 
that suggest that after a transposition has occurred, morphophpnemic 
and phonotactic rules renormalize the sequence. It is very rare 
that even an erroneous output sequence violates either of thei:;o 
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two types of rules.. As an example of morphophonemic rule operation, 
consider Fromkin's example of "a kice ream cone" for "an ice cream 
cone," where the form of the indefinite article in the erroneous 
phrase is changed , obviously to fit the form of the new initial 
segment of the next word. 

In Fromkin's example of "flay the pictor" for "play the vic- 
tor," a phonotactic rule apparently devoices the transposed /v/ 
(giving /f/) after the transposition, as /vl/ is not a permissible 
sequence in English. 

The last three stages in the model will be discussed later 
in the paper. All ordering errors are deemed to occur before the 
operation of the target specification mechanism. They certainly 
occur before the motor control mechanism, because it apparently 
operates efficiently to produce the movements appropriate to the 
sequencing of the units, showing that the sequencing, although 
wrong, is fully specified before the motor control mechanism is 
activated. The point here is that the specification of the appro- 
priate movements for a segment must follow the ordering decision, 
otherwise the production, for a segment, of movements appropriate 
to its old context, would seldom result in its acceptable produc- 
tion. 

This is obviously a grossly oversimplified model of the speech 
production process. For example, it completely omits certain 
questions such as the status of affixes and the question of 
whether the segmental consequences of lexical stress are assigned by 
' rule or stored- I present it only in order to give some approxima- 

tion to the stages involved in the speech production process and 

O 
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to summarize some of the error types ^ and arguments that have 
been put forth for the reality status of the various units in 
speech production theory. 

The stages of the mechanism discussed so far (that is down 
to the morphological and phonological monitor) serve the purpose 
of providing a subset of linguistic forms with an assigned serial 
order for an utterance whether right or wrong. The final stage 
of speech production which I will call the peripheral stage is 
the conversion of this output into a patterned ^ time-extended 
acoustic signal. Perhaps the main thing we have learned about 
this stage in the last ten years or so is that it is not a good 
place to look for evidence about the identitv and nature of lin- 
guistic units. Much of recent phonetic history is a rather melan- 
choly progression of failures to find peripheral correlates of 
linguistic units. Casualties have included the syllabic chest 
pulse (Stetson, 1951? Ladefoged, Draper, Whitteridge, 1958), in- 
variant motor correlates of the phoneme (Liberman, Cooper^,^ Harris, 
MacNeilage and Studdert-Kennedy , 1967; MacNeilage, 1970), the 
archetypal breath group (Lieberman, 19 67; Ohala, 1970) and most 
recently, coarticulatory marking of syllable boundaries (MacNeilage, 
1972) . Perhaps all of these related blind alleys needed to be 
f^xplored. And perhaps there will be more of them, reflecting our 
bondage to the a priori necessity of thinking about xanguage and 
speech in terms of linguistic units. But one thing I think our 
failure should tell us is that it may be time to consider the 
peripheral stage primarily in terms of its own dynamic properties 
rather than in terms of abstract linguistic units. The search 

er|c '/ 
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for invariants has told us at least that linguistic units do not 
reach output without considerable modification. What is the nature 
of this modification? Hockett (1955) has suggested a model, and 
it is the appropriate time of the year to show the input to his 
model on the next slide. This is a set of variously colored but 
not boiled Easter eggs, representing segments which are the input 
to Hockett 's model. According to Hockett the output modification 
of these segments is analagous to forcing them through a wringer 
in the correct order. Some may favor this as a working hypothesis. 
But the wringer did not even survive in washing machine technology 
so it would be embarrassing if it had to carry a heavy theoretical 
burden in phonetics (not to mention the problem of the speaker 
having egg on his face) . Although it is ?^t yet possible to des- 
cribe the peripheral stage of the production process accurately 
with an analogy comparable in vividness and mnemonic value to 
Hockett 's I would like to suggest an alternative view of segmental 
aspects of the process that at least incorporates more of what we 
know about the peripheral stage than Hockett 's. The central ques- 
tion we are concerned with is: Given that there is a discrete 
linguistic input to the mechanism of speech production at some 
stage, and given that the mechanism that transmits this input 
is incapable of discrete units of output, what is the nature of 
the transformation, at the peripheral stage, of one form to tho 
other. 

The main reason that there is no simple 1:1 correspondence 
between sc^gments or features and speech signals is because arti- 
culators, on whose position the segmental or feature information 

O 
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depends, move relatively slowly from one required quasistationary 
state to another so that segment-related signals are always cohtin 
gent on their immediate segmental contexts • Furthermore, we know 
that this immediate context effect can stretch at least as far 
as four segments in each "direction" (MacNeilage, 1972). This 
context dependence is one of the two most central facts about the 
lack of invariant peripheral correlates of the segment or the fea- 
tures and the principles by which it is controlled must be taken 
into account in any satisfactory speech production theory. ^nis 
is the main reason why a theory based on invariant motor commands 
for segment types of features is unsatisfactory; namely, because 
it postulates the inverse of what is typically observed. Th^ 
second central fact about the status of segments at the periphery 
is that despite this context dependence, articulators typically 
approximate a single quasi-stationary state for a given s^ment, 
regardless of segmental contexts, at least in careful speech. 
This fact has given rise to a number of theories that what is 
invariant in the peripheral stage of segment production is the 
specification of targets or conf igurational goals towards which 
the articulators strive (MacNeilage, 1970). A target theory thus 
has the advantage of preserving an invariant segment or feature- 
related input to the model while not being inconsistent with the 
variance in the output. I also think that targets have reality 
status as means of interfacing the invariant linguistic unit 
level with the context dependent leyel in a way that phonemes 
and features don't. That is, they are consistent with what we 
know about how the brain works (MacNeilage, 1970). Unfortunately, 



like phonemes and features, targets are not directly observable. 
Even in careful speech, articulators do not seem to reach exactly 
the same place for a segment regardless of context. (They are 
typically characterized by undershoot.) In most target or con- 
figuratix5ival' theories , this undershoot is attributed to the slug- 
^^[i f^\\n^:ii^^^ -'f he peripheral mechanical system and even to the in- 
jx^i.iii..y to deliver neural signals rapidly enough. At best, this 
is an oversimplified explanation. That this is so can be seen 
from an examination of some aspects o: segmental dynamics- One 
fact that has been known for a long time is that when speaking 
rate is increased the duration of vowels decreases proportionately 
more than consonants (Chistovich, et al; 1965). It seems to me 
that , according to a simple neuromechanical inertia model, con- 
sonants should decrease in terms of articulation time as much as 
vowels. But they don't, and I wish to suggest that the reason 
that they don't is that if they did, it would result in a much 
greater decrease in the amount of segmental acoustic information 
available from consonants than from vowels. It is known that 
consonants carry a greater information load at the segmental level 
than vowels, at least in English (Denes, 1963). It is also known 
that in the case of vowels, decreases in duration due to speaking 
rate increases result in progressively more undershoot of hypo- 
thetical target values inferred from careful speech (Moll and 
Shriner, 1967) . It is therefore natural to suppose that if the 
articulation of consonants followed the same rules as the articu- 
lation of vowels, the increased rate would result in increased 
undershoot. However, whereas increases in amount of undershoot 
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result in only quantitative changes in the acoustic signal for 
vowels - namely changes in formant frequencies - increases in 
undershoot could result in qualitative changes in the signal for 
consonants and these changes might sometimes great enough to 
make the consonant apparently shift its manner class which could 
be highly misleading to the listener. For example, an undershot 
stop might generate friction and undershot fricatives might appear 
as stops if voiceless and glides if voiced. For this perceptual 
reason; that is, because of the reception problems that would result 
from consonants behaving like vowels with rate increase, the pro- 
duction system may impose more restraint on undershoot in con- 
sonants than in vowels. If true, this suggests that undershoot 
is not simply a result of built-in neuromechanical inertia of the 
peripheral system but, at least to some extent, contingent on the 
degree of temporo-spatial control chosen by the production system. 

In fact it is quite possible that none of the undershoot we 
see is due to a neuromechanical ceiling effect. Rather, the pro- 
duction system may simply be taking advantage of the demonstrated 
capacity of the perceptual system to extrapolate from the acousti- 
cal correlates of undershoot when it needs to decode the segmental 
structure of the message (Lindblom and Studdert-Kennedy , 1967). 

I would like to describe the constraints on vowels and con- 
sonants that I have just discussed in terms of the concept of 
Tolerance Rules > I would like to suggest that along with target 
specifications for vowels and consonants the production system 
specifies tolerances allowed in the approximation to these targets , 
and that these tolerances are less generous for consonants than 
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for vowels. The concept of tolerance rules has at least been 
hinted at in some speech synthesis models, and has some affinity 
to Stevens' ideas about the .quantal nature of speech (Stevens, 
1972). He has suggested that languages choose segments, the arti- 
culation of which allows maximal tolerance, in that the undesirable 
acoustical consequences of imprecise articulation are relatively 
minor. This can be, in some sense, viewed as a diachronic view 
of tolerances. I am suggesting that tolerances for consonants are 
in general less than those for vowels and that this fact is repre- 
sented synchronically in the means of control of articulation, and 
particularly in segment duration rules. 

Some evidence that is available about rates of articulator 
movement suggest that at least with regard to speaking rate these 
proposed tolerance rules are formulated primarily in terms of time 
constraints on segment durations^ rather than in terms of movement- 
rate constraints. Studies of articulators during what I would 
class as a movement from one target to another suggest that when 
speaking rate increases there is proportionately more change in 
the duration of a movement than of its average velocity (MacNeilage 
and DeClerk, 1968). Furthermore, I am not aware of any good evi- 
dence that rates of articulator movement are in general greater 
in movements from vowel to consonant than from consonant to vowel 
targets . 

It does seem to be true that there are differences in rates 
of articulator movement associated with different consonantal 
categories, A number of people have shown that the lip- jaw complex 
moves at a faster rate from vowels to voiceless bilabial stops 
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than to voiced (MacNeilage, 1972). There is also some evidence 
that rate of articulator movement from vowels to fricatives is 
slower than to stops (Kim, 1972; Kent and Moll, 1972). It is 
possible that these differences are due, in the case of stop con- 
sonant voicing to 'compensation' for differences in aerodynamic 
forces against the area of occlusion (Ohman, 1967) and in the case 
of the fricative-stop difference^ the necessity for greater preci- 
sion of articulator positioning to produce adequate frication 
(Stevens and House ^ 1963) . If so, then these appear to be cases 
where differences in tolerance values do affect rates of articula- 
tor movement. It is of interest in the light of what I said about 
time tolerances being greater for vowels than consonants that the 
'price' the system pays in time for the slower movement rates for 
approaches to voiced stops and fricatives is apparently paid by th< 
vowel. This is especially well established, in several languages, 
for the voicing related difference in vowel duration preceding 
stops (Chen, 1970) . 

One reason for arguing that there are rather strict time- 
tolerance rules for consonants comes from a study that Jaemin Kim 
and I have done on durations of occlusion of intervocalic stop 
consonants in citation form as a function of the degree of open- 
ness, of the adjacent vowels (Kim and MacNeilage, 1972). Slide 3 
shows the results of this study in terms of closure durations of 

8 subjects. It seems quite clear from these results that the 

i 

distance from target-to-target either in the VC or the CV trans- 
ition had negligible effects on duration of occlusion. 
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I The main advantage of a system that has segment-specific 
tolerance rules for duration of activation of articulators moving 
toward targets is that it makes possible the kind of versatile 
"digital-to-analog" conversion that appears to occur in the peri- 
phery (using this analogy loosely) . It appears that the input 
must be in terms of invariants, and I am guessing that these are 
targets. We know the output is characterized by its variability and 
I suspect that simple time-related neuromuscular ceiling effects 

f 

aren't the cause. I am suggesting that the variability in target 
approximation is due to tolerance rules that have a base value for 
individual segments in careful speech and differing constants pro- 
portional to their base value that can produce a continuum of 
'degradation* for each sentence as targeting demands are decreased. 
Thus speaking rate can assume a number of values on a continuum, 
and given this input, the tolerance constants of segments give 
as output their corresponding values of approximation to target.^ 
For example, Kim has observed that fricatives reduce in duration 
less in unstressed syllables, relative to stressed, than stops, 
suggesting that they have more restrictive tolerance rules (Kim, 
1972). 

Despite the problems that I know it has, thinking in terms 
of targets and tolerance rules may be a wayi of coming to grips 
with the invariance paradox at the necessary level, namely at 
the level of articulatory dynsunics. One thing it offers is a way 

^Related rules may apply to stress in English but in this case 
rate of articulator movement may be more variable than with speak- 
ing rate changes . 
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of coining to grips with the whole range of real speech behavior, 
including style shifting, and not just with citation form speech. 

I will conclude by summarizing the message. Although we must 
acknowledge the relevance of various linguistic units for speech 
production theory and strive to learn more aboiat them, we can't 
afford to stop there. Instead we must try to integrate ideas 
about linguistic units with hypotheses based on the dynamic pro- 
perties of the peripheral system, and a conceptual schema based 
on targets and tolerance rules is an attempt in that direction. 
Finally, I hope you haven't found this paper to be characterized 
by the misprinted phrase in the printed abstract of the paper 
which says "A spea-ier's output is a continuous strain." 
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Slide Captions 

Slide 1 ; Schematic view of the organization of speech production. 
(From MacNeilage and MacNeilage, 19 73.) 

Slide 2 t (Not reproduced here.) Multicolored easter eggs. 

Slide 3 : Average durati6ns of consonantal occlusion for 8 subjects 

produced four consonants in four intervocalic environments. 
(From Kim, 1972.) ' 
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CCHEMATIC VIEW OF THE ORGANIZATION OF SPEECH PRODUCTION 
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